-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make thriftbp.NewBaseplateClientPool()
behave more like the grpcbp
one
#602
Conversation
842275c
to
242517c
Compare
clientpool/channel.go
Outdated
i++ | ||
} else { | ||
lastAttemptErr = err | ||
log.Warnf("clientpool: error creating required client (will retry): %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that I would add a log.Warn
here, I would worry about this causing a lot of log spam if there's an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fishy really wanted a way to make the error clear to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can limit the log spam with a rate limiter:
(it should be at whatever scope you want to collect together, so global if you want to limit the spam at a per process level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added RL at 2s
addrGen: func() thriftbp.AddressGenerator { | ||
i := 0 | ||
return func() (string, error) { | ||
i += 1 | ||
var err error | ||
if i == 1 { | ||
err = fmt.Errorf("something broke") | ||
} | ||
return ln.Addr().String(), err | ||
} | ||
}(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔕 Is there any other way for us to fake these errors other than hacking them into the AddressGenerator? This isn't really the kind of error we would expect to get since most of these just return a static string (we added it to be compatible with some old ads code).
Not a blocker, but it might be nice to have the errors be closer to errors that we would actually expect/not rely on something that is really just a backwards compatibility hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, that's how every other test I found does it so I used the same approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been swamped, from a cursory review this looks fine. We might want to discuss always trying at least initialConnection times (i.e. not bailing on the first failure) as well, since I think that is an innocuous behavior change, but we can do that separately.
clientpool/channel.go
Outdated
i++ | ||
} else { | ||
lastAttemptErr = err | ||
log.Warnf("clientpool: error creating required client (will retry): %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can limit the log spam with a rate limiter:
(it should be at whatever scope you want to collect together, so global if you want to limit the spam at a per process level)
ea19e8b
to
c0a7191
Compare
c0a7191
to
edad33a
Compare
💸 TL;DR
Changing the behavior of the pool initialization to that it retries to open connections until the required number is opened instead of failing on the first issue.
📜 Details
Prior to this PR the behavior of the thriftbp pool initialization was as followed:
initialConnections
times to open a connection, but bail out at the first error.requiredInitialConnections
then crash.initialConnections
connections opened.This lead to reliability issues for services as the pods sometimes end up starting with very few connections ready. These connections are opened inline with the requests but that adds extra latency and opportunity for failures.
In my opinion this also is a confusing behavior since we end up returning a lower number than configured by the user. It is explicit in the documentation but I still found more than a few people surprised by this behavior.
With this PR applied the new behavior is as followed:
requiredInitialConnections
and retry if necessary until either success or context errorinitialConnections
using the same behavior as before.While not my ideal approach, this allows the user to easily increase the effectively-usable number on
requiredInitialConnections
while leaving theinitialConnection
behavior unchanged.Rollout:
requiredInitialConnections
to higher values than currently.🧪 Testing Steps / Validation
✅ Checks